-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 24 more languages, including JSON, Kotlin, XML, YAML etc... #33
Conversation
This is great @yoeo! I did notice some decrease in confidence for Java. The following snippet use to have over 60% confidence: public class PositiveNegative {
public static void main(String[] args) {
double number = 12.3;
// true if number is less than 0
if (number < 0.0)
System.out.println(number + " is a negative number.");
// true if number is greater than 0
else if ( number > 0.0)
System.out.println(number + " is a positive number.");
// if both test expression is evaluated to false
else
System.out.println(number + " is 0.");
}
} but using this branch, it's down to 20% confident it's Java. My guess is that the introduction of Groovy hurt the confidence? |
Nice catch @TylerLeonhardt. This model is still "work in progress" and I hope that training it with more examples and for a longer time will help improve its predictions. |
@yoeo the JSON and YAML predictions were great, btw. Such a game changer :) I hope to have this in a VS Code Insider release either this week or next. Exciting times! |
I investigated on the confidence drop that you noticed. For example, here is are box plots of the probabilities that I got by testing 5k Java files: We can see that the addition of Groovy and Dart hurts Java detection confidence, but almost all the time the files are still correctly detected as Java files. The probability plots for all the languages are available here: |
@yoeo this is amazing work! I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence. For example: 30% Java and <1% everything else means it's probably Java. I don't know if 30%/1% is the best pair of numbers...but I'll give it a go. I'm open to suggestions from you since you're the expert 😃 |
Hi @TylerLeonhard The model is now fully trained. Its overall accuracy is pretty good ~93.5% (the original model accuracy was ~93.8%) echo "public class PositiveNegative {
....
}" | guesslang --probabilities
Language name Probability
Java 41.63%
Groovy 24.83%
C# 6.17%
... I'm pretty happy with these results and I'll merge this PR after updating the documentation.
You're perfectly right I think. Lines 160 to 168 in cbc441d
And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule 🙂 Thanks. |
…w newer version don't work on Python 3.6
😁 interesting! Thanks for sharing. I think I'll try to make sure my solution aligns with that and with what you're already doing. Excited to see this change go in! |
Support the following languages:
Prediction accuracy is 92.59% but the training and test dataset were not well balanced due to lack of files for some languages.
And there were errors in the Pascal dataset.